Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines

نویسندگان

  • Tetsuji Nakagawa
  • Taku Kudo
  • Yuji Matsumoto
چکیده

The accuracy of part-of-speech (POS) tagging for unknown words is substantially lower than that for known words. Considering the high accuracy rate of up-to-date statistical POS taggers, unknown words account for a non-negligible portion of the errors. This paper describes POS prediction for unknown words using Support Vector Machines. We achieve high accuracy in POS tag prediction using substrings and surrounding context as the features. Furthermore, we integrate this method with a practical English POS tagger, and achieve accuracy of 97.1%, higher than conventional approaches.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Word Segmentation and Part - of - Speech Tagging : a Machine Learning Approach Incorporating Diverse Features ∗

The aim of this dissertation is to study statistical methods for multilingual word segmentation and POS tagging with high accuracy. Word segmentation and part-of-speech (POS) tagging are fundamental language analysis tasks in natural language processing, and used in many applications. Existence of unknown words is a large problem in these tasks and they need to be properly handled. We attempt t...

متن کامل

Prediction of part of speech tags for punjabi using support vector machines

Part-Of-Speech (POS) tagging is a task of assigning the appropriate POS or lexical category to each word in a natural language sentence. In this paper, we have worked on automated annotation of POS tags for Punjabi. We have collected a corpus of around 27,000 words, which included the text from various stories, essays, day-to-day conversations, poems etc., and divided these words into different...

متن کامل

High Speed Unknown Word Prediction Using Support Vector Machine for Chinese Text-to-Speech Systems

One of the most significant problems in POS (Part-of-Speech) tagging of Chinese texts is an identification of words in a sentence, since there is no blank to delimit the words. Because it is impossible to pre-register all the words in a dictionary, the problem of unknown words inevitably occurs during this process. Therefore, the unknown word problem has remarkable effects on the accuracy of th...

متن کامل

Automatic Rule Induction for Unknown-Word Guessing

Words unknown to the lexicon present a substantial problem to NLP modules that rely on morphosyntactic information, such as part-of-speech taggers or syntactic parsers. In this paper we present a technique for fully automatic acquisition of rules that guess possible part-of-speech tags for unknown words using their starting and ending segments. The learning is performed from a general-purpose l...

متن کامل

Unsupervised Learning of Word-Category Guessing Rules

Words unknown to the lexicon present a substantial problem to part-of-speech tagging. In this paper we present a technique for fully unsupervised statistical acquisition of rules which guess possible partsof-speech for unknown words. Three complementary sets of word-guessing rules are induced from the lexicon and a raw corpus: prefix morphological rules, suffix morphological rules and ending-gu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001